Accurate Prediction of y Ions in Beam-Type Collision-Induced Dissociation Using Deep Learning

Peptide fragmentation spectra contain critical information for the identification of peptides by mass spectrometry. In this study, we developed an algorithm that more accurately predicts the high-intensity peaks among the peptide spectra. The training data are composed of 180,833 peptides from the National Institute of Standards and Technology and Proteomics Identification database, which were fragmented by either quadrupole time-of-flight or triple-quadrupole collision-induced dissociation methods. Exploratory analysis of the peptide fragmentation pattern was focused on the highest intensity peaks that showed proline, peptide length, and a sliding window of four amino acid combination that can be exploited as key features. The amino acid sequence of each peptide and each of the key features were allocated to different layers of the model, where recurrent neural network, convolutional neural network, and fully connected neural network were used. The trained model, PrAI-frag, accurately predicts the fragmentation spectra compared to previous machine learning-based prediction algorithms. The model excels at high-intensity peak prediction, which is advantageous to selective/multiple reaction monitoring application. PrAI-frag is provided via a Web server which can be used for peptides of length 6–15.


Supplementary Data S2 Detailed training model structure.
Inputs. The training model takes in six inputs which were one-hot encoded peptides' amino acid sequence, collision energy (CE), charge, length, the number of proline residue and sliding window of 4-mers. The six inputs were either calculated or converted from peptide amino acid sequences and this information was fed onto the model in three different types of layers. The recurrent neural network (RNN) layer takes in the embedded one-hot encoded peptides' amino acid sequence to learn sequential information from amino acid sequences. The fully connected layer takes in four inputs, CE, charge, length, and the number of proline residues which will here on be called feature_group1. The information in feature_group1 was fed into the model via a single full connected layer as the sequential information was not crucial. The convolutional neural network layer takes in the sliding window of 4-mers. The sliding window of 4mers contained information on sequentially serial 4-mers of each peptide. For example, a peptide with the sequence of VAGAAVAK is transformed to VAGA, AGAA, GAAV, AAVA, AVAK after sliding window. The sliding window for VAGAAVAK produce 5 sliding windows of 4-mers which is equivalent to peptide length -3. For a peptide with length of 15, 12 sliding windows of 4-mers can be generated. To feed this information into the CNN layers, the sliding window was designed to be represented in a 12 by 4 shaped twodimensional (2D) matrix. This shape was equivalent to a single channel image data with 12 X 4 pixels: thus, enabling CNN implementation.
Model architecture. The model is structured to take in three different types of data to different layers of RNN, FCN and CNN which is then combined to be decode on a second RNN layer. The one-hot encoded sequence were fed onto bidirectional gated recurrent units (GRU) layer with hidden size of 128 where the hidden states were saved and S4 forwarded to the second GRU layer described later. The GRU layer output was subsequently forwarded to dropout layer (p=0.4), leaky rectified linear unit ReLu with 0.3 gradient and to attention mechanism layer with dropout (p=0.1). The output after dropout was forwarded to a feature size of 256 which is later multiplied to the output matrix concatenated from feature 1 FCN layer and sliding window CNN layers.
The sliding window of 4-mers were inputted with 12 X 4 matrix is forwarded to 2D convolution layer with kernel size of 1 X 4, stride of 1 and ReLu function to a 12 X 1 matrix with 48 channels. The output was subsequently forwarded to the second 2D convolution layer with kernel size of 2 X 1, stride of 1 and leaky ReLu function with 0.3 gradient to a 11 X 1 matrix with 128 channels. The output was forwarded to the third convolution layer with kernel size of 11 X 1 and stride of 1 and dropout (p=0.4) to a 1 X 1 matrix with 14 channels. The output from 2D convolution layers was flattened to feature size of 224. The output of the sliding windows of 4-mers are concatenated to the output of a single FCN layer from feature 1 which forwards 4 features to feature size of 32. The concatenated output from sliding window of 4-mers and feature_group1 results in a feature size of 256 which was multiplied to the output from the first GRU layer also with a feature size of 256 and fed to the second GRU layer.
The second GRU layer takes in the multiplied feature and the hidden states from the first GRU layer. The output from the second GRU layer was forwarded to dropout layer (p=0.4), second attention mechanism layer with leaky ReLu (0.3 gradient). The output with features size of 256 are subsequently reduced to 128 and finally to 42 as the final output.

Supplementary Data S3 Model evaluation
Model parameters. For model comparison, rat QTOF data obtained from NIST databases (2013-06-05), Escherichia Coli QTOF data obtained from PRIDE (PXD001587) and Mus Musculus QTOF data obtained from PRIDE (PXD008651) were used as the evaluation database which were parsed in similar format of the training database. Peptides with less than three peaks and peptides of length higher than 15 were removed. Redundant peptides with the training database were removed which left 3,709 tryptic peptides for evaluation. Modifications, such as carbamidomethylation on cysteine was ignored.
Evaluation was performed using the NIST rat data unless mentioned otherwise. The compared models were Prosit_2020_intensity_hcd, MS 2 PIP_QTOF and MS 2 PIP_HCD. For the Prosit_2020_intensity_hcd model, 22 NCE combination from NCE 18 to NCE39 have been tested (Supplementary Fig. S3). The CE value of each peptide that were required as input for Prosit, were calculated using equation S1 (Supplementary Data. S1). The MS 2 PIP was tested without modification option for TripleTOF 5600+ model and HCD model.

Model comparison for simplified peptide spectrum match analysis.
To simulate a simplified peptide spectrum match analysis, we first grouped peptides from the NIST rat data by m/z similarity. Peptides with similar m/z values (± 0.5) for the precursor ion was grouped which resulted in 3,658 groups, where each group had an average of 12.154 peptides per group. Among the grouped peptide, we searched for peptides that contains at least three product ions in similar m/z values (± 0.5) which reduced the number of groups to 1,822 groups with 2.548 peptides per group. The grouping was performed to simulate an MRM analysis without standard (heavy peptides) where multi-peaks were observed for the targeted m/z transitions. For every group, there exists a "target peptide" S6 which is the actual peptide we want to deduce. The other peptides that were within m/z ± 0.5 range to the target peptide were noises that cannot be easily differentiated. Each model than predicts the fragmentation spectrum of every peptide in the group which was compared to the actual target peptide fragmentation spectrum. This was repeated until every peptide in the group has been the "target peptide". The similarity between all peptides in the group against the target peptide was calculated by PCC and mean squared error (MSE) for all spectrum or for the highest 3 intensity peaks. To avoid bias occurring from MSE calculation, all model's predicted maximum intensity value for each peptide was normalized to 1, by dividing with the maximum value. The accuracy of each model was calculated by counting the number instance where maximum scoring peptide was equivalent to the target peptide.

Supplementary Data S4 Alternative model description.
The altered version of the PrAI-frag shares the same structure of the PrAI-frag. The loss function, however, was altered to impose greater weight on the accuracy of the highest intensity. The mean squared error (MSE) function used in PrAI-frag estimates the error rate of the total output which would be the difference between 42 intensity per peptides.